Robust knowledge-guided biclustering for multi-omics data

Abstract Biclustering is a useful method for simultaneously grouping samples and features and has been applied across various biomedical data types. However, most existing biclustering methods lack the ability to integratively analyze multi-modal data such as multi-omics data such as genome, transcriptome and epigenome. Moreover, the potential of leveraging biological knowledge represented by graphs, which has been demonstrated to be beneficial in various statistical tasks such as variable selection and prediction, remains largely untapped in the context of biclustering. To address both, we propose a novel Bayesian biclustering method called Bayesian graph-guided biclustering (BGB). Specifically, we introduce a new hierarchical sparsity-inducing prior to effectively incorporate biological graph information and establish a unified framework to model multi-view data. We develop an efficient Markov chain Monte Carlo algorithm to conduct posterior sampling and inference. Extensive simulations and real data analysis show that BGB outperforms other popular biclustering methods. Notably, BGB is robust in terms of utilizing biological knowledge and has the capability to reveal biologically meaningful information from heterogeneous multi-modal data.


INTRODUCTION
Biclustering is a popular statistical approach for simultaneously grouping both rows (features) and columns (samples) of a data matrix.By organizing the data into sub-matrices that contain similar elements, biclustering has gained significant traction, particularly in high-dimensional settings where only a limited number of features contribute significantly to differentiating various clusters.A prominent application of biclustering lies in complex disease subtyping, which aims to identify coherent subgroups of individuals based on specific biological features.Numerous studies have demonstrated the capacity of biclustering to unveil crucial biological insights, thus offering interpretable solutions [1,2].
The existing biclustering methods can be categorized into four classes: (i) probabilistic and generative methods including FABIA [3], SSBi [4] and SSLB [5]; (ii) matrix factorization approaches including SSVD [6] and iSSVD [7]; (iii) combinatorial methods including OPSM [8] and ISA [9] and (iv) neural networks including BARTMAP [10] and AD [11].The SSLB used Spike-and-Slab Lasso priors to induce adaptive shrinkage on both factor and loading matrices.However, it is specifically designed for single-view data and lacks the ability to harness crucial information from diverse yet related data, such as multi-omics data.The iSSVD, based on singular value decomposition, is a frequentist method that incorporates stability selection [12] to control Type I error rates when detecting biclusters.While it allows for the analysis of multiview data, it does not accommodate discrete data types, such as read counts in RNA-Seq, single nucleotide polymorphisms (SNPs) and copy number data.The ISA employs the greedy algorithm to identify submatrices with all rows and columns above a prespecified threshold, which can significantly effect the biclustering results and their interpretation.We refer readers to [13] for more detailed reviews of biclustering methods.
However, none of the aforementioned biclustering methods have utilized biological graph knowledge, despite evidence demonstrating the effective improvement in variable selection performance across different statistical tasks [14,15].An example of such biological knowledge is gene regulatory networks, which provide valuable information regarding regulators and their potential targets.Numerous public resources, such as Kyoto Encyclopedia of Genes and Genomes (KEGG; [16]), Gene Ontology (GO; [17]) and Mummichog [18], collect and store biological graph information.A recently proposed probabilistic biclustering method, known as GBC [19], developed hierarchical shrinkage priors to incorporate biological graph information.Although GBC demonstrates promising performance compared with other methods lacking graph incorporation, we found a limitation in their graph-incorporated priors.Specifically, their priors require the inverse covariance matrix of the shrinkage parameters, which imposes the graph information, to be diagonal dominant.This restriction crucially hampers the efficient use of graph information, especially when the underlying network grows larger.
In this article, we propose a novel Bayesian graph-guided biclustering (BGB) method, which comprises three key components.First, the BGB prior formulation effectively addresses the diagonal-dominant issue of GBC, enabling more efficient utilization of graph information.The proposed priors offer explicit control over the correlation level of the shrinkage parameters and adaptability to different prior knowledge.This f lexibility empowers the model to handle scenarios involving larger network sizes or constructed networks containing noise.Second, the BGB, using the data augmentation technique in [20], facilitates a unified framework capable of handling multi-modal inputs that may consist of diverse data types, such as Gaussian, binomial and negative binomial data.Third, the proposed method allows for the retrieval of overlapped biclusters, while a majority of existing biclustering methods [21,22] assume orthogonality among resulting biclusters.This is a crucial capability in applications where biclustering is frequently applied, such as cancer subtyping studies.In these contexts, the same gene might be co-active in multiple subtypes.
The rest of the paper is organized as follows.Section 2 introduces the proposed method.Section 3 presents an efficient Markov chain Monte Carlo (MCMC).In Section 4, we conduct simulation studies to assess the performance of our method in comparison with the existing methods.In Section 5, the performance of the proposed method is evaluated on the real data.We conclude with a brief discussion in Section 6.

Model framework
Suppose X = [(X (1) ) T , . . ., (X (H) ) T ] T is the data matrix consisting of H modalities, where X (h) is related to the hth modality with p h features and n samples.The total number of features is p = H h=1 p h .The proposed method aims to identify submatrices of X that correspond to the biclusters containing similar elements x ji .Formally, a bicluster is defined by subsets of rows (features) and columns (samples) for which the columns exhibit the similar behavior on the corresponding subset of rows and vice versa.Such a submatrix manifests as an outer product wz T of two vectors w and z.Building on this insight, we consider a multiplicative bicluster model with the following structure: where μ relates to the expectation of X through the link functions depending on the data types and 1 is the vector with each entry equal to 1.
R n×L refer to the location vector, factor loading matrix and latent factor matrix, respectively.We follow the convention that the superscript (subscript) refers to the column (row).The number of biclustering L is allowed to be unknown and is learnt through the model selection procedure.We assume that both the factors and loading are sparse.One important property of the proposed model is its incorporation of biological graph information which is denoted as G = P, E , with a set of nodes (features) P and a set of edges E, representing the interactions between these features.We assume that there is no edge across modalities.A matrix is considered compatible with the graph G if its off-diagonal entries exhibit the same zero/nonzero pattern as the adjacency matrix of G. Intuitively, the incorporation of the biological graph information helps reduce false negatives and false positives in variable selection.For example, if the jth feature and kth feature are (not necessarily directly) connected, there is a higher chance that these two features will be selected or not selected simultaneously for a given latent factor.This is ref lected in the factor loading matrix W, where w jl and w kl tend to be either zero or non-zero for a specific 1 ≤ l ≤ L.
The likelihood function is assumed to be conditionally independent as follows: where π j (•) refers to the density function for the jth variable.A number of distributions can be considered as the model inputs including the Gaussian, binomial and negative binomial distribution.For Gaussian data, the likelihood is with the precision ρ ji ≡ ρ j .For binomial data, the likelihood is n j x ji e μ ji x ji with the number of trials n ji ≡ n j and x ji ∈ {0, 1, . . ., n j }.For negative binomial data, the likelihood is r j +x ji −1 x ji e μ ji x ji (1+e μ ji ) r j +x ji with the failure parameter with r ji ≡ r j and the non-negative integer x ji .The probability p ji of binomial density or negative binomial density is connected with μ ji through the logit function.

Prior setting
Hierarchical structure for the factor loadings W We consider the following hierarchical priors for W: where α l = α l − ν 1 1, = η(1 + ) and E = (11 T + I).M G is the set of positive definite and symmetric matrices compatible with the graph G.Each element of W is assigned with a Laplacian prior, where λ jl refers to the individual penalty parameter that helps to achieve the adaptive shrinkage on the factor loadings.We further impose hierarchical priors on λ to incorporate the graph knowledge.To this end, we employ a multivariate Gaussian prior for α l = (α 1l , . . ., α pl ) T for 1 ≤ l ≤ L, where α jl = log λ jl with the precision matrix , who is associated with a novel prior known as the graph-constrained Wishart prior.It is worth noting that such hierarchical prior indicates that the correlations between the shrinkage parameters are governed by the underlying graph knowledge.Specifically, the presence of edges are expected to impact the correlations, leading to a similar shrinkage level for connected features.Hence, these connected features tend to be simultaneously selected or excluded for within a bicluster.The parameters ν 1 , ν 2 , η and can be either user-specified or selected through a tuning procedure.In particular, ν 1 controls the overall a priori shrinkage level, while ν 2 affects the adaptivity to this pre-specified level of shrinkage.When ν 2 → 0, all shrinkage parameters λ jl will concentrated around e ν1 .Both η and are positive and control the adaptivity to the data as well as the level of correlations among the shrinkage parameter, respectively.
The benefits of utilizing such a novel constrained Wishart prior in terms of variable selection are 2-fold compared with other graph-guided priors.First, unlike the MRF priors [23], which suffer from phase transition phenomenon, this novel prior can explicitly control the correlations between the shrinkage parameter through the parameter .It is because the mode of the prior, without the graph constraints, is ( 11 T + I 1+ ) −1 , implying that the offdiagonal entries of −1 are concentrated around 1 1+ .Moreover, the parameter η allows us to control over the level of adaptivity to this prior.Increasing η results in the posterior of being more dependent on the prior and less impacted by the data.Second, our prior allows for a more f lexible matrix space that needs to be compatible with the underlying graph G, as we do not require the diagonal-dominant structure defined in [19], which hinders the efficient utilization of graph information when the size of the pathways in the graph becomes larger.
Hierarchical structure for the latent factor Z We also pursue sparsity in the latent factor matrix to identify biclusters.Similar to the factor loadings W, we place a Laplacian prior for each entry of Z, while the associated individual shrinkage parameter δ li has the gamma distribution The shape parameter ν 4 determines the overall sparsity level of Z with a larger value of ν 4 yielding a sparser Z.The rate parameter ν 3 controls the adaptivity.As ν 3 decreases, the prior densities of δ li tend to have heavier tails, indicating that δ li becomes more adaptive to Z.Both ν 3 and ν 4 need to be pre-specified.
Finally, the prior on the location vector m is the multivariate normal distribution with the prior variance σ 2 m j which is given by

COMPUTATION
In this section, we develop an efficient MCMC algorithm to explore the full posterior distribution (2) Two data augmentation techniques are used to improve the MCMC efficiency.First, the augmentation of the Pólya Gamma latent variable ρ ji [20] for the discrete data x ji facilitates an unified likelihood for the different data types where x j , ρ j and μ j refer to the jth row of X, ρ and μ respectively.The values of ψ ji , κ ji , b ji and the prior density π(ρ j ) vary with the type of x j and are given in Table 1.Suppose , K and ρ are p × n matrices with ψ j ( ψ j ), κ j ( κ j ) and ρ j ( ρ j ) referring to their jth row (column).G(•, •) (PG(•, •)) is the gamma (Pólya gamma) distribution.Second, we augment the Laplacian variable w jl and z li to a scale mixture of normals [24], introducing the augmented variables τ w jl and τ z li , to facilitate the use the Gibbs sampler.We refer Section 2.1 in the Appendix for the full description of the data augmentation.Note that while most parameters can be sampled with the Gibbs sampler, the transformed shrinkage parameter α has to be sampled via the Metropolis-Hastings (MH) algorithm.

Conditional posterior distribution
The conditional distribution for is sampled column by column with the block-wise Gibbs sampler.We first represent , = −1 and A = η(11 ) T , after implicit row and column permutations, in the block forms, respectively, where ω jj , ω −j,j and −j,−j refer to the jth diagonal element, the offdiagonal vector of the jth column ω j and the submatrix of after removing the jth row and column.The block components of and A can be interpreted in the same way.Next, we represent −j,−j and .The superscript 11 (00) implies that the matrix is associated with the subvector of nonzero (zero) entries in ω −j,j , denoted as ω 1 j (ω 0 j ).At the same time, σ 1 j (σ 0 j ) and a 1 j (a 0 j ) ω 1 j (ω 0 j ) in and A are vectors associated with ω 1 j (ω 0 j ).The sampling of ω j can be factorized into the following two parts: where ξ j = ω jj − (ω 1 j ) T −1 j ω 1 j with j = 11 j − 10 j ( 00 j ) −1 01 j and μ j = − j a 1  j /a jj .The notation N (a, b) refers to the normal distribution with mean a and covariance matrix b.The sample of ω jj is obtained from ξ j .It is worth noting that sampling ω 1  j can be inefficient due to the computation of the inverse matrix ( 00 j ) −1 , which is mitigated by introducing .To be specific, we can avoid such inefficient computation if we update with the latest sample of at each iteration due to the relationship The updating strategy of is given by The conditional distribution for ρ The conditional posterior of ρ ji is the Pólya-Gamma (or Gamma) distribution for the discrete (Gaussian) data which is given by The conditional distributions for W and τ w Suppose m j = m j 1 ∈ R n×1 with m j as the jth entry of m.P j is a diagonal matrix with ρ j as its diagonal entries.The conditional distributions for the jth row of W w j and (τ w jl ) 2 are given by Table 1: Likekihood for different data types and formula components of Pólya Gamma classes.x ji , n j and r j refer to the data, the number of trials and the number of failures.G and PG are the gamma and Pólya Gamma distribution where a j = (ZP j Z T + D −1 τ w ) −1 ZP j (ψ j − m j + P −1 j κ j ), B j = (ZP j Z T + D −1 τ w ) −1 with the diagonal matrix D τ w , whose diagonal entries are the jth row of τ , μ jl = λ jl w jl and λ jl = λ 2 jl .IN (μ, λ) refers to the inverse-Gaussian distribution, whose density is defined in [25].

The conditional distribution for α
There is no close form for the conditional posterior of α jl and we use the MH algorithm with the target distribution being where α −j,l = (α 1l , . . ., α j−1,l , α j+1,l , α pl ).In the tth iteration, we generate the candidate sample α (cand)   jl from the proposal distribution N (α (t−1) jl , q), where q is a pre-specified variance and the acceptance
The conditional distribution for Z, τ z and δ Suppose P i is a diagonal matrix with ρ j as its diagonal entries.The conditional distributions for the ith column of Z z i , (τ z li ) 2 and δ are given by where with the diagonal matrix (D τ z ) i and (D δ ) i , whose diagonal entries are the ith columns of τ z and δ and u il =

The conditional distribution for m
The conditional posterior of m is the multivariate normal with each entry being conditionally independent, whose conditional distribution is given by where The overall time complexity of the MCMC algorithm is O( p 2 (p + e) + npL 2 ), where p = max p h for 1 ≤ h ≤ H and e refers to the number of edges.

Tuning
We use a variant of deviance information criteria (DIC) [26] to select the tuning parameters where θ is a subset of potential tuning parameters (ν 1 , ν 2 , ν 3 , ν 4 , , η), l(•, •) refers to the log-likelihood and the mean of log-likelihood , where the superscript t refers to the tth MCMC samples.We conduct grid search for (ν 1 ,ν 3 ,ν 4 , L) and select the model with the smallest DIC.In the selected model, the estimation of the lth bicluster is determined by whether the posterior credible intervals of w jl or z li , with a pre-specified coverage probability, cover zero.Specifically, when the posterior credible interval of w jl or z li covers zero, it implies the jth feature or ith sample is selected by the lth bicluster.In our simulation and real data analysis, we use the 95% posterior credible interval.

SIMULATION
In this section, we compare the performance of BGB to existing biclustering methods in two simulation settings.These methods include (i) GBC [19]; (ii) SSLB [5]; (iii) ISA [9] and (iv) iSSVD [7].Methods (i) and (ii) are Bayesian factor-model-based biclustering methods.iSSVD is a frequentist biclustering method based on the sparse singular value decomposition.The ISA does not assume a model for the data matrix and finds biclusters by selecting rows and columns using a certain threshold.Except for iSSVD, which is implemented in Python, all other methods are implemented in R. In our simulation studies, we fix ν 2 = 0.5 to allow for reasonable prior variance of shrinkage parameters in factor loadings; η = 15 and = 0.1 to impose rational dependence on the graph and prior correlation of the shrinkage parameters.The grid-search area for the other tuning parameters is as follows: [27] is used to determine the MCMC convergence for BGB.All simulation results are based on the 100 repeated samplings.Additional simulation studies can be found in Section 3.3 of Supplementary Materials.
We use the following metrics to measure the similarity between the estimated biclusters and true biclusters: (i) relevance (Rel) and recovery (Rec) [28]; (ii) clustering error (CE) [29] and (iii) consensus score (CS) [3] (for more details, see Section 3 of the Supplementary Material).All of the metrics are constructed using the Jaccard index and range between 0 and 1. Rel measures how well the estimated biclusters represent the true biclusters on average, while Rec instead evaluates how similar the true biclusters are to the estimated biclusters on average.The first step to calculate CS and CA is matching the estimated biclusters and true biclusters by the Hungarian method [30].Then, CS reports the maximum overlapping proportion between two groups of biclusters and CA, which is defined as 1−CE, can be viewed as a weighted version of the CS, scaled by the bicluster size when calculating the overlaping proportions.Therefore, larger biclusters may have a greater impact on CA than CS.
We conducted simulation studies with n = 90 samples, p = 300 features and L = 6 true biclusters in the non-overlapping scenario (Simulation 1) and overlapping scenario (Simulation 2), which is overlapped on the feature-wise with the size 15.In each setting, we further consider four graph structures and two different data types: the Gaussian and binomial data.W is a p × L matrix whose non-zero entry indices in each column range from 1 + 50(l − 1) to 50l in Simulation 1, or from 1 + 48(l − 1) to min{48(l − 1) + 63, 300} in Simulation 2, for 1 ≤ l ≤ 6. Z is an L × n matrix with the same sparsity pattern in both settings.The non-zero entry indices in each row range from 1 + 15(l − 1) to 15l with 1 ≤ l ≤ 6.The non-zero entries of W and Z are generated from N (2, 0.1 2 ) and randomly assigned to be positive or negative with the probability 0.5.For the Gaussian data, we generate X through N (WZ, 4).For the binomial data, we generate X through Binomial(n j , 1 1+e −WZ ), where the number of trial n j is randomly sampled from 10 to 30.
We consider four working graph structures, denoted as G 1 to G 4 , all based on the structure of W. These graphs differ in their informativeness, which is determined by the types of edges they contain.Edges connecting nodes belonging to the same pathway are referred to as within-pathway edges, while edges connecting nodes from different pathways are labeled as acrosspathway edges.Within-pathway edges are expected to provide useful information, whereas across-pathway edges are considered noise.Specifically, G 1 represents the null graph with no edges.G 2 consists of 6 underlying pathways, corresponding to the columns of W, respectively, and contain only within-pathway edges.Each pathway is defined by the non-zero features of a certain column of W, making G 2 the true graph that is compatible with the structure of W. G 3 is constructed by adding across-pathway edges to G 2 with a probability of 0.1.Lastly, G 4 is formed by replacing the withinpathway edges in G 2 with across-pathway edges with a probability of 0.5.For a visual representation of these graphs, please refer to Section 3 of the Supplementary Material.
Table 2 and 3 show that the proposed method (BGB) consistently outperforms the competitors in both simulation settings, according to various metrics.The incorporation of biological graph knowledge has proven to be effective in enhancing the performance of BGB across all cases.Most notably, the performance of BGB under G 3 , although not as strong as under G 2 and G 4 , still surpasses the other methods and is better than under G 1 .This primarily demonstrates the robustness of the proposed method to noisy edges.All of methods, regardless of scenarios, tend to overestimate the number of biclusters L; however, the overestimation of BGB is within a quiet narrow margin compared with SSLB and iSSVD.Additionally, existing methods designed for continuous data yield poor performance on binomial data.

REAL DATA ANALYSIS
Alzheimer's disease (AD) is an irreversible neurodegenerative disease caused by the deterioration of brain tissue, which results in accelerated cognitive decline relative to normal aging.While there exist treatments for enhancing impaired neuron systems, these treatments cannot effectively prevent the progression of cognitive decline.A major obstacle in developing an effective treatment for AD is the incomplete understanding of its pathological mechanism.Hence, it is of critical importance to investigate and comprehend the etiology of AD.
In this article, we conduct the real data analysis that integrated SNP and gene expression data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).The dataset comprises n = 435 subjects from different phases of the ADNI study, including ADNI GO, ADNI2 and ADNI1.Among the samples, there are 39 with AD, 126 with normal cognitive (CN) and 270 with mild cognitive impairment (MCI).This is served as the ground truth when computing the performance metrics.The gene expression data include continuous measurements for 20 093 unique genes.We first shrinkage our attention to the top 10 000 genes with the largest sample variance since AD is believed to be primarily related to a small number of genes.Next, we employ the high dimensional ordinary least-squares projection (HOLP) [31] to reduce the dimensionality from an ultra-high number of features to a more manageable scale.HOLP is a feature screening method possessing the sure screening property, which guarantees that the refined subset of features will preserve the active features of the true model with overwhelming probability.We applied HOLP on the reduced subset to select the top 10% relevant genes, totaling 1000 genes.The response variable used for HOLP is the Clinical Dementia Rating Scale Sum of Boxes (CDR-SB) score, a commonly used measurement to diagnose dementia of AD.The SNP data, viewed as binomial variables taking values from {0, 1, 2}, include the top 935 SNPs related to AD based on existing genome-wide association studies (GWAS) of AD [32].Similarly, we select the top 10% SNPs based on their P-values.The gridsearch area of BGB is as follows: ν 1 ∈ {−5, −2.5, −1, 0, 1, 2.5, 5}, ν 2 ∈ {0.5, 2}, ν 3 {1, 0.1, 0.01, 0.001} and ν 4 ∈ {1, 4, 8, 16, 32}.The maximum number of biclusters is set as 6 for all methods to ensure a fair comparison.
Table 4 reports CS and CE for different methods.Both BGB and GBC are evaluated with (denoted with the superscript * ) or without graph knowledge.The results in Table 4 show that BGB with the graph incorporation compares favorably with existing methods in terms of CS and CA, demonstrating the benefits of using the graph knowledge.GBC, another method that is able to incorporate the graph information, ranks as the second-best method.Furthermore, we conducted pathway enrichment analysis through ToppGene Suite, a one-stop portal for the pathway enrichment analysis, to explore the biological potential of the proposed method.A pathway is said to be enriched in a set of genes if this set of genes is found to be over-represented in the larger set of genes constituting the pathway.With the false discovery rate threshold of 0.05, we found four statistically significant enriched pathways with the genes selected by the proposed method: reactome sensory perception pathway, KEGG olfactory transduction, reactome ligand-receptor interactions pathway, and WP eicosanoid metabolism via cyclooxygenases cox pathway, whose P-values are 7.5 × 10 −5 , 2.2 × 10 −4 , 2.6 × 10 −4 and 2.2 × 10 −4 , respectively.All of these enriched pathways have been demonstrated to be closely related to the AD [33][34][35].

DISCUSSION
In this article, we propose a novel Bayesian biclustering method, aiming at grouping features and samples simultaneously.On the one hand, the proposed method enables the integrative analysis of multi-omics data that may consist of different data types, which allows for a comprehensive understanding of complex biological systems.On the other hand, we develop a novel prior to exploit the underlying biological graph information, leading to improved clustering performance.Extensive simulation studies and real data analysis demonstrate the superior performance and robustness of our proposed method, even in the presence of noisy graphs.The application of BGB extends to disease subtyping studies and other multi-modality learning tasks.There are several potential extensions for future work.First, it would be interesting to extend the model to accommodate model selection procedure in a fully Bayesian framework.This will help determine the optimal model complexity and improve the interpretability of the results.Second, it is worthwhile to explore algorithmic optimizations and leverage parallel computing techniques to enhance the computational efficiency of our method.Furthermore, expanding the scope of our model to address more complicated problems, such as the scenarios with complex graph structures, incomplete multi-omics data or biological data beyond -omics data including imaging data and fMRI functional connectivity data, is another avenue of interest.Finally, motivated by the existing graph-knowledge-guided deep learning models [36][37][38], integrating the proposed graph-incorporated technique to the deep learning structures would be helpful to enhance the interpretability of the deep learning models.

Key Points
• We propose a novel knowledge-guided Bayesian biclustering method that can handle multiple modalities of different data types such as continuous and count data.• A key feature of our method is its ability to incorporate biological graph knowledge effectively and f lexibly.This feature enables our method to yield biologically more interpretable biclustering results that can shed insights into the molecular underpinnings of disease subtypes.• Extensive numerical experiments have demonstrated our method's superiority over several existing methods and its robustness when graphs used are noisy.

Table 2 :
Performance of BGB, GBC, SSLB, iSSVD and ISA for Gaussian and binomial data over replications in Simulation 1 (non-overlapping).Four graph structures G i for i = 1, 2, 3, 4 are considered.Rec/Rel/CS/CA refers to the relevance/recovery/census score/clustering accuracy.L is the estimated number of biclusters.Standard errors are shown in parentheses

Table 3 :
Performance of BGB, GBC, SSLB, iSSVD and ISA for Gaussian and binomial data over replications in Simulation 2 (overlapping).Four graph structures G i for i = 1, 2, 3, 4 are considered.Rec/Rel/CS/CA refers to the relevance/recovery/census score/clustering accuracy.L is the estimated number of biclusters.Standard errors are shown in parentheses

Table 4 :
Performance of different methods on ADNI dataset.Rec/Rel/CS/CA refers to the relevance/recovery/census score/clustering accuracy.The superscript ' * ' refers to 'with the incorporation of graph'